Homework 4:

Follow the steps below to:
- Read wine.csv in the data folder.
- The First Column contains the Wine Category. Don't use it in the models below. We are going to treat it as unsupervized learning and compare the results to the Wine column.
Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.
Try PCA and see how much can you reduce the variable space.
- How many Components did you need to explain 99% of variance in this dataset?
- Plot the PCA variables to see if it brings out the clusters.
Try KMeans and Hierarchical Clustering using data from PCA and compare again the clusters to the Wine column.

Dataset

wine.csv is in data folder under homeworks



In [105]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
%matplotlib inline
np.set_printoptions(suppress= True)



In [106]:

    
wine = pd.read_csv('../data/wine.csv')



In [107]:

    
wine.tail()









    Out[107]:






  
    
      
      Wine
      Alcohol
      Malic.acid
      Ash
      Acl
      Mg
      Phenols
      Flavanoids
      Nonflavanoid.phenols
      Proanth
      Color.int
      Hue
      OD
      Proline
    
  
  
    
      173
       3
       13.71
       5.65
       2.45
       20.5
        95
       1.68
       0.61
       0.52
       1.06
        7.7
       0.64
       1.74
       740
    
    
      174
       3
       13.40
       3.91
       2.48
       23.0
       102
       1.80
       0.75
       0.43
       1.41
        7.3
       0.70
       1.56
       750
    
    
      175
       3
       13.27
       4.28
       2.26
       20.0
       120
       1.59
       0.69
       0.43
       1.35
       10.2
       0.59
       1.56
       835
    
    
      176
       3
       13.17
       2.59
       2.37
       20.0
       120
       1.65
       0.68
       0.53
       1.46
        9.3
       0.60
       1.62
       840
    
    
      177
       3
       14.13
       4.10
       2.74
       24.5
        96
       2.05
       0.76
       0.56
       1.35
        9.2
       0.61
       1.60
       560



In [108]:

    
wine.Wine = wine.Wine - 1



In [109]:

    
y = wine.Wine



In [110]:

    
X = wine.iloc[:,1:]



In [111]:

    
kmeans = KMeans(n_clusters = 3, random_state = 1)
Y_hat_kmeans = kmeans.fit(X).labels_



In [112]:

    
plt.scatter(X.ix[:,0], X.ix[:,1], c = Y_hat_kmeans, s = X.ix[:,4]*2)









    Out[112]:





<matplotlib.collections.PathCollection at 0x110417e50>



In [113]:

    
print confusion_matrix(Y_hat_kmeans, y)
plt.matshow(confusion_matrix(Y_hat_kmeans, y))
plt.title('confusion matrix')
plt.xlabel('Y_hat_kmeans')
plt.ylabel('actual values')
plt.colorbar()









    



[[46  1  0]
 [ 0 50 19]
 [13 20 29]]






    Out[113]:





<matplotlib.colorbar.Colorbar instance at 0x11052bb00>



In [113]:



In [114]:

    
from sklearn.decomposition import PCA
from sklearn import preprocessing



In [115]:

    
X_scale = preprocessing.scale(X)
comp = np.arange(14)
explained_var = []
for i in comp:
    pca = PCA(n_components= i)
    X_pca = pca.fit_transform(X_scale)
    explained_var.append(pca.explained_variance_ratio_.sum())
plt.plot(comp, explained_var)









    Out[115]:





[<matplotlib.lines.Line2D at 0x11073a1d0>]



In [116]:

    
pca.explained_variance_ratio_









    Out[116]:





array([ 0.36198848,  0.1920749 ,  0.11123631,  0.0706903 ,  0.06563294,
        0.04935823,  0.04238679,  0.02680749,  0.02222153,  0.01930019,
        0.01736836,  0.01298233,  0.00795215])



In [117]:

    
comp = np.arange(13) + 1
explained_var = []
for i in comp:
    pca = PCA(n_components= i)
    X_pca = pca.fit_transform(X)
    explained_var.append(pca.explained_variance_ratio_.sum())
plt.plot(comp, explained_var)









    Out[117]:





[<matplotlib.lines.Line2D at 0x1111c0e10>]



In [118]:

    
print pca.explained_variance_ratio_









    



[ 0.99809123  0.00173592  0.00009496  0.00005022  0.00001236  0.00000846
  0.00000281  0.00000152  0.00000113  0.00000072  0.00000038  0.00000021
  0.00000008]



In [119]:

    
pca = PCA(n_components=4)
X_pca = pca.fit_transform(X)



In [120]:

    
Y_hat_kmeans = kmeans.fit(X_pca).labels_



In [121]:

    
plt.scatter(X_pca[:,0],X_pca[:,1], c = Y_hat_kmeans)
plt.colorbar()









    Out[121]:





<matplotlib.colorbar.Colorbar instance at 0x111366908>



In [122]:

    
# compute distance matrix
from scipy.spatial.distance import pdist, squareform

distx = squareform(pdist(X_pca, metric='euclidean'))
distx









    Out[122]:





array([[   0.        ,   31.21669984,  122.82027877, ...,  230.22891326,
         225.20472279,  506.05806074],
       [  31.21669984,    0.        ,  135.2135927 , ...,  216.21883165,
         211.20932631,  490.23283102],
       [ 122.82027877,  135.2135927 ,    0.        , ...,  350.56698345,
         345.55722715,  625.06802374],
       ..., 
       [ 230.22891326,  216.21883165,  350.56698345, ...,    0.        ,
           5.14098121,  276.07931837],
       [ 225.20472279,  211.20932631,  345.55722715, ...,    5.14098121,
           0.        ,  281.06097274],
       [ 506.05806074,  490.23283102,  625.06802374, ...,  276.07931837,
         281.06097274,    0.        ]])



In [123]:

    
kmeans = KMeans(n_clusters = 3, random_state = 1)
Y_hat_kmeans = kmeans.fit(X_pca).labels_



In [124]:

    
print confusion_matrix(Y_hat_kmeans, y)
plt.matshow(confusion_matrix(Y_hat_kmeans, y))
plt.title('confusion matrix')
plt.xlabel('Y_hat_kmeans')
plt.ylabel('actual values')
plt.colorbar()









    



[[46  1  0]
 [ 0 50 19]
 [13 20 29]]






    Out[124]:





<matplotlib.colorbar.Colorbar instance at 0x111724b00>



In [124]:



In [124]:



In [124]:



In [124]:



In [ ]:

	Wine	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
173	3	13.71	5.65	2.45	20.5	95	1.68	0.61	0.52	1.06	7.7	0.64	1.74	740
174	3	13.40	3.91	2.48	23.0	102	1.80	0.75	0.43	1.41	7.3	0.70	1.56	750
175	3	13.27	4.28	2.26	20.0	120	1.59	0.69	0.43	1.35	10.2	0.59	1.56	835
176	3	13.17	2.59	2.37	20.0	120	1.65	0.68	0.53	1.46	9.3	0.60	1.62	840
177	3	14.13	4.10	2.74	24.5	96	2.05	0.76	0.56	1.35	9.2	0.61	1.60	560